library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.7      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.0      ✔ stringr 1.4.0 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
## ✔ broom        1.0.1     ✔ rsample      1.1.0
## ✔ dials        1.0.0     ✔ tune         1.0.0
## ✔ infer        1.0.3     ✔ workflows    1.1.0
## ✔ modeldata    1.0.1     ✔ workflowsets 1.0.0
## ✔ parsnip      1.0.1     ✔ yardstick    1.1.0
## ✔ recipes      1.0.1     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org
library(ISLR)
library(ggplot2)
library(yardstick)
library(ISLR2)
## 
## Attaching package: 'ISLR2'
## 
## The following objects are masked from 'package:ISLR':
## 
##     Auto, Credit
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-4
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(corrr)
library(corrplot)
## corrplot 0.92 loaded
library(rpart.plot)
## Loading required package: rpart
## 
## Attaching package: 'rpart'
## 
## The following object is masked from 'package:dials':
## 
##     prune
library(kknn)
library(vip)
## 
## Attaching package: 'vip'
## 
## The following object is masked from 'package:utils':
## 
##     vi
library(janitor)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(xgboost)
## 
## Attaching package: 'xgboost'
## 
## The following object is masked from 'package:dplyr':
## 
##     slice
library(dplyr)
library(skimr)
## 
## Attaching package: 'skimr'
## 
## The following object is masked from 'package:corrr':
## 
##     focus
library(kernlab)
## 
## Attaching package: 'kernlab'
## 
## The following object is masked from 'package:scales':
## 
##     alpha
## 
## The following object is masked from 'package:purrr':
## 
##     cross
## 
## The following object is masked from 'package:ggplot2':
## 
##     alpha
tidymodels_prefer(quiet = FALSE)
## [conflicted] Will prefer dplyr::filter over any other package
## [conflicted] Will prefer dplyr::select over any other package
## [conflicted] Will prefer dplyr::slice over any other package
## [conflicted] Will prefer dplyr::rename over any other package
## [conflicted] Will prefer dials::neighbors over any other package
## [conflicted] Will prefer parsnip::fit over any other package
## [conflicted] Will prefer parsnip::bart over any other package
## [conflicted] Will prefer parsnip::pls over any other package
## [conflicted] Will prefer purrr::map over any other package
## [conflicted] Will prefer recipes::step over any other package
## [conflicted] Will prefer themis::step_downsample over any other package
## [conflicted] Will prefer themis::step_upsample over any other package
## [conflicted] Will prefer tune::tune over any other package
## [conflicted] Will prefer yardstick::precision over any other package
## [conflicted] Will prefer yardstick::recall over any other package
## [conflicted] Will prefer yardstick::spec over any other package
## ── Conflicts ──────────────────────────────────────────── tidymodels_prefer() ──
Poke<-read.csv("/Users/Mac/data/Pokemon.csv") 
Poke_cln <- clean_names(Poke)

Introduction

Propose of the project

As data science and other data analysis positions increase in demand and popularity, the future prospects and salary levels of data scientists have become a topic of great concern. Data scientists process large amounts of data by using modern tools and techniques to discover invisible patterns, obtain meaningful information, and make business decisions. A series of complex machine learning algorithms and data analysis problems have brought many people with advanced data analysis education into the industry and positions. So the aim of this project is to give people who want to enter the field an idea of how data scientists have been paid in recent years and what factors have influenced their salaries.

knitr::include_graphics("/Users/Mac/data/DS.png")

## Why is this model relevant? It mainly about find the reasons for why some Data scientist get higher salary than others and how should DS do to get higher salary. What kind of factors caused this outcomes.By visualizate the relationship between salary and other variables, we can make some conclusions.

Loading Data

ds<- read.csv(file= "/Users/Mac/data/ds_salaries.csv")
head(ds)
##   X work_year experience_level employment_type                  job_title
## 1 0      2020               MI              FT             Data Scientist
## 2 1      2020               SE              FT Machine Learning Scientist
## 3 2      2020               SE              FT          Big Data Engineer
## 4 3      2020               MI              FT       Product Data Analyst
## 5 4      2020               SE              FT  Machine Learning Engineer
## 6 5      2020               EN              FT               Data Analyst
##   salary salary_currency salary_in_usd employee_residence remote_ratio
## 1  70000             EUR         79833                 DE            0
## 2 260000             USD        260000                 JP            0
## 3  85000             GBP        109024                 GB           50
## 4  20000             USD         20000                 HN            0
## 5 150000             USD        150000                 US           50
## 6  72000             USD         72000                 US          100
##   company_location company_size
## 1               DE            L
## 2               JP            S
## 3               GB            M
## 4               HN            S
## 5               US            L
## 6               US            L
dim(ds) 
## [1] 607  12
# show me how many observations in the new dataset
# show me how many variables in the new dataset

Data Packages

Data Science Job Salaries Dataset contains 11 columns and 606 observations, each are:

-‘work_year’: The year the salary was paid.

  • experience_level: The experience level in the job during the year EN = Entry-level / Junior; MI = Mid-level / Intermediate; SE = Senior-level / Expert; EX = Executive-level / Director

  • employment_type: The type of employment for the role

    • PT = Part-time;
    • FT = Full-time;
    • CT = Contract;
    • FL = Freelance;
  • job_title: The role worked in during the year.

  • salary : The total gross salary amount paid.

  • salary_currency: The currency of the salary paid as an ISO 4217 currency code.

  • salaryinusd: The salary in USD

  • employee_residence: Employee`s primary country of residence in during the work year as an ISO 3166 country code.

  • remote_ratio: The overall amount of work done remotely

    • 0 = No remote work (less than 20%);
    • 50 = Partially remote;
    • 100 = Fully remote (more than 80%)
  • company_location: The country of the employer`s main office or contracting branch

  • company_size : The median number of people that worked for the company during the year

    • S = less than 50 employees (small);
    • M = 50 to 250 employees (medium);
    • L = more than 250 employees (large)

Exploratory Data Analysis

While the data set that was downloaded was tidy, fefore modeling,we need different cleaning steps.

clean Data

Clean name

ds_cln <- ds %>%
  clean_names() 

Data Summary

Make a general sight for this dataset

names(ds_cln)
##  [1] "x"                  "work_year"          "experience_level"  
##  [4] "employment_type"    "job_title"          "salary"            
##  [7] "salary_currency"    "salary_in_usd"      "employee_residence"
## [10] "remote_ratio"       "company_location"   "company_size"
class(ds_cln$Feature)
## [1] "NULL"
str(ds_cln)
## 'data.frame':    607 obs. of  12 variables:
##  $ x                 : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ work_year         : int  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
##  $ experience_level  : chr  "MI" "SE" "SE" "MI" ...
##  $ employment_type   : chr  "FT" "FT" "FT" "FT" ...
##  $ job_title         : chr  "Data Scientist" "Machine Learning Scientist" "Big Data Engineer" "Product Data Analyst" ...
##  $ salary            : int  70000 260000 85000 20000 150000 72000 190000 11000000 135000 125000 ...
##  $ salary_currency   : chr  "EUR" "USD" "GBP" "USD" ...
##  $ salary_in_usd     : int  79833 260000 109024 20000 150000 72000 190000 35735 135000 125000 ...
##  $ employee_residence: chr  "DE" "JP" "GB" "HN" ...
##  $ remote_ratio      : int  0 0 50 0 50 100 100 50 100 50 ...
##  $ company_location  : chr  "DE" "JP" "GB" "HN" ...
##  $ company_size      : chr  "L" "S" "M" "S" ...
summary(ds_cln)
##        x           work_year    experience_level   employment_type   
##  Min.   :  0.0   Min.   :2020   Length:607         Length:607        
##  1st Qu.:151.5   1st Qu.:2021   Class :character   Class :character  
##  Median :303.0   Median :2022   Mode  :character   Mode  :character  
##  Mean   :303.0   Mean   :2021                                        
##  3rd Qu.:454.5   3rd Qu.:2022                                        
##  Max.   :606.0   Max.   :2022                                        
##   job_title             salary         salary_currency    salary_in_usd   
##  Length:607         Min.   :    4000   Length:607         Min.   :  2859  
##  Class :character   1st Qu.:   70000   Class :character   1st Qu.: 62726  
##  Mode  :character   Median :  115000   Mode  :character   Median :101570  
##                     Mean   :  324000                      Mean   :112298  
##                     3rd Qu.:  165000                      3rd Qu.:150000  
##                     Max.   :30400000                      Max.   :600000  
##  employee_residence  remote_ratio    company_location   company_size      
##  Length:607         Min.   :  0.00   Length:607         Length:607        
##  Class :character   1st Qu.: 50.00   Class :character   Class :character  
##  Mode  :character   Median :100.00   Mode  :character   Mode  :character  
##                     Mean   : 70.92                                        
##                     3rd Qu.:100.00                                        
##                     Max.   :100.00

Since the value of salary is large, I choose to subtract 1000 so that to make my EDA more clear.

ds_usd<- mutate(ds_cln, usd_salary_subtract_thousand=salary_in_usd/1000) 
head(ds_usd)
##   x work_year experience_level employment_type                  job_title
## 1 0      2020               MI              FT             Data Scientist
## 2 1      2020               SE              FT Machine Learning Scientist
## 3 2      2020               SE              FT          Big Data Engineer
## 4 3      2020               MI              FT       Product Data Analyst
## 5 4      2020               SE              FT  Machine Learning Engineer
## 6 5      2020               EN              FT               Data Analyst
##   salary salary_currency salary_in_usd employee_residence remote_ratio
## 1  70000             EUR         79833                 DE            0
## 2 260000             USD        260000                 JP            0
## 3  85000             GBP        109024                 GB           50
## 4  20000             USD         20000                 HN            0
## 5 150000             USD        150000                 US           50
## 6  72000             USD         72000                 US          100
##   company_location company_size usd_salary_subtract_thousand
## 1               DE            L                       79.833
## 2               JP            S                      260.000
## 3               GB            M                      109.024
## 4               HN            S                       20.000
## 5               US            L                      150.000
## 6               US            L                       72.000

Check missing value

sum(is.na(ds_cln)) 
## [1] 0

Visual EDA

Corrlation

To look at correlations among the continuous variables, we will use the corrr package. The correlate() function will calculate the correlation matrix between all the variables that it is given. We choose to remove experience_level, job_title, employee_residence, company_location, company_size, as it is not numeric. Then we pass the results to rplot() to visualize the correlation matrix.

ds_cln%>% 
  select(c(work_year , salary, salary_in_usd, remote_ratio)) %>% 
  select_if(is.numeric) %>% 
  cor() %>% 
  corrplot(method = 'number', diag = F, type = 'upper', bg = 'blue')

According to the correlation matrix,salary_in_usd is most revelant with employee`s work year.

Company Location

ds_cln%>% 
 group_by(company_location)%>%
 ggplot(aes(x = forcats::fct_infreq(company_location))) + 
 geom_bar() +
 coord_flip()

The graph above shows that most data related workers are live in American

Salary distribution

ds_his <- ggplot(data = ds_usd,
    mapping = aes(x = usd_salary_subtract_thousand))
ds_his + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Through this table, we can see most data scientists` salary between $10k-180k, only tiny amount of workers can earn $200k+.

Data sciencist salary by work year

ggplot(data = ds_usd, aes(factor(work_year), usd_salary_subtract_thousand)) +
  geom_boxplot() + 
  geom_jitter(alpha = 0.1) +
  xlab("Year")+labs(title = "Data Science Salary by year")

ggplot(data = ds_usd, aes(factor(work_year), usd_salary_subtract_thousand, fill=employment_type)) + geom_col(position = "dodge")

These two table shows that more and more people get higher salaries from 2020 to 2022. And full time workers have completely higher salary than other part-time or contract worker. Also, we can see the there is a sudden increase salary in contract workers at 2021. We might assume this situation caused by Covid-19 and a lot of worker choose to work at home.

Data Science Salary by experience

ggplot(data = ds_usd, aes(factor(experience_level), usd_salary_subtract_thousand)) +
  geom_boxplot() + 
  geom_jitter(alpha = 0.1) +
  xlab("experience_level")+ 
  labs(title = "Data Science Salary by experience")

ggplot(data = ds_usd, aes(factor(experience_level), usd_salary_subtract_thousand,fill = employment_type))  + geom_col(position = "dodge")

Through the first graph, we can see job titles have direct relationship with salary. Executive-level has the highist average salary. Salary and rank are positively correlated. The higher the rank, the higher the salary. The second chart shows that the Executive-level is all contract and full-time workers, and FL is only found in the middle and senior levels.

Data Science Salary by Remote Ratio

ggplot(data = ds_usd, aes(factor(remote_ratio), usd_salary_subtract_thousand)) +
  geom_boxplot() + 
  geom_jitter(alpha = 0.1) +
  xlab("remote ratio")+labs(title = "Data Science Salary by remote ratio")

From this graph it is not possible to draw a direct link between the remote ratio and wages. However, we can see that those closer and further away have higher wages and those in the middle have lower wages.

Data Science Salary by Company Size

ggplot(data = ds_usd, aes(factor(company_size), usd_salary_subtract_thousand)) +
  geom_boxplot() + 
  geom_jitter(alpha = 0.1) +
  xlab("company size")+labs(title = "Data Science Salary by company size")

ggplot(data = ds_usd, aes(factor(company_size), usd_salary_subtract_thousand,fill = employment_type))  + geom_col(position = "dodge")

The salaries of employees in medium-sized companies are relatively concentrated, with the average company paying slightly more than large companies. And the bigger the company, the more full-time employees there are.

Data Science Salary by Employee Sesidence

ggplot(data = ds_usd,aes(x = usd_salary_subtract_thousand, y = employee_residence )) + 
  geom_boxplot() +
  theme_bw() +
  labs(x = "usd_salary(thousdands)", y = "emlpoyee_residence") 

As can be clearly seen from the chart, data scientists in the United States have a higher average salary.

Model Preparation

Initial Split

The data was split in a 80% training, 20% testing split. And the seeds here can make us produce same result every time.

set.seed(2022)
ds_split <- initial_split(ds, prop = 0.80, strata = salary_in_usd) #use stratified sampling
ds_train <- training(ds_split)
ds_test <- testing(ds_split)

Create Recipe

Because we are going to be use the same predictors, model conditions, and response variable, we create one central recipe for all of our models to work with. I use all of my variable to create this recipe

simple_ds_recipe <- recipe(salary_in_usd ~ ., data = ds_train) 
simple_ds_recipe
## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         11
ds_recipe <- recipe(salary_in_usd ~ ., data = ds_train) %>% 
  step_dummy(all_nominal_predictors())
# there is one outcomes and 11 predictors

Cross-Validation

We will use layered cross validation to help solve the problem of data imbalance. We will fold the training set by v-fold and put v=5. Of course, we will layer the response variable salary_in_usd.

# Fold the training set using v-fold cross-validation, with 'v = 5'. Stratify on the outcome variable.
ds_folds <- vfold_cv(ds_train, v = 5, strata = salary_in_usd)

Models

We’re now going to see which model does best at v=5.We performed the analysis through the following models: Linear Regression, Elastic NetTuning, Decision Tree, Boosted Tree.

Linear Regression

lm_model <- linear_reg() %>% 
  set_engine("lm")
lm_wflow <- workflow() %>% 
  add_model(lm_model) %>% 
  add_recipe(ds_recipe)

Finally, we can fit the linear model to the training set

lm_fit <- fit(lm_wflow, ds_train)

Then, review model results

lm_fit %>% 
  # This returns the parsnip object:
  extract_fit_parsnip() %>% 
  # Now tidy the linear model object:
  tidy()
## # A tibble: 173 × 5
##    term                     estimate std.error statistic       p.value
##    <chr>                       <dbl>     <dbl>     <dbl>         <dbl>
##  1 (Intercept)         36498320.       1.42e+7     2.58  0.0104       
##  2 X                         24.5      2.60e+1     0.943 0.346        
##  3 work_year             -18060.       7.01e+3    -2.58  0.0104       
##  4 salary                     0.0204   8.03e-3     2.54  0.0114       
##  5 remote_ratio              76.2      5.59e+1     1.36  0.174        
##  6 experience_level_EX    94968.       1.63e+4     5.81  0.0000000141 
##  7 experience_level_MI    28617.       7.68e+3     3.72  0.000229     
##  8 experience_level_SE    49404.       8.15e+3     6.06  0.00000000349
##  9 employment_type_FL    -97333.       5.57e+4    -1.75  0.0813       
## 10 employment_type_FT    -88023.       2.87e+4    -3.07  0.00233      
## # … with 163 more rows
ds_train_res <- predict(lm_fit, new_data = ds_train %>% select(-salary_in_usd))
## Warning in predict.lm(object = object$fit, newdata = new_data, type =
## "response"): prediction from a rank-deficient fit may be misleading
ds_train_res %>% 
  head()
## # A tibble: 6 × 1
##    .pred
##    <dbl>
## 1 20000.
## 2 35735.
## 3 16297.
## 4 64503.
## 5 42271.
## 6 23044.
ds_train_res <- bind_cols(ds_train_res, ds_train %>% select(salary_in_usd))
ds_train_res %>% 
  head()
## # A tibble: 6 × 2
##    .pred salary_in_usd
##    <dbl>         <int>
## 1 20000.         20000
## 2 35735.         35735
## 3 16297.         51321
## 4 64503.         40481
## 5 42271.         39916
## 6 23044.          8000
ds_train_res %>% 
  ggplot(aes(x = .pred, y = salary_in_usd)) +
  geom_point(alpha = 0.6) +
  geom_abline(lty = 2) + 
  theme_bw() +
  coord_obs_pred()

lm_models <- linear_reg() %>% 
  set_engine("lm")

lm_wk_flow <- workflow() %>% 
  add_model(lm_models) %>% 
  add_recipe(ds_recipe)

lm_cv <- fit_resamples(lm_wk_flow, resamples = ds_folds)
## ! Fold1: preprocessor 1/1, model 1/1 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold2: preprocessor 1/1, model 1/1 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold3: preprocessor 1/1, model 1/1 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold4: preprocessor 1/1, model 1/1 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold5: preprocessor 1/1, model 1/1 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
collect_metrics(lm_cv)
## # A tibble: 2 × 6
##   .metric .estimator      mean     n   std_err .config             
##   <chr>   <chr>          <dbl> <int>     <dbl> <chr>               
## 1 rmse    standard   72992.        5 8108.     Preprocessor1_Model1
## 2 rsq     standard       0.291     5    0.0411 Preprocessor1_Model1

This is my simple linear regression model. From the graph we can conclude there is a positive relationship between predictors and salary. And rsq value is 0.379, which means only 37.9% variability can be used in this model, thus simple linear regression is not best model for the data.

Elastic Net Tuning

elastic_net <-linear_reg(penalty = tune(), mixture = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("glmnet")

elastic_net_workflow <- workflow() %>% 
  add_recipe(ds_recipe) %>% 
  add_model(elastic_net)

elastic_net_grid <- grid_regular(penalty(range = c(-10, 10)), mixture(range = c(0,1)), levels = 10)

tune_res <- tune_grid(elastic_net_workflow,resamples = ds_folds, grid = elastic_net_grid)
## ! Fold1: preprocessor 1/1, model 1/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: preprocessor 1/1, model 2/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: preprocessor 1/1, model 3/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: preprocessor 1/1, model 4/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: preprocessor 1/1, model 5/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: preprocessor 1/1, model 6/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: preprocessor 1/1, model 7/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: preprocessor 1/1, model 8/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: preprocessor 1/1, model 9/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: preprocessor 1/1, model 10/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: internal: A correlation computation is required, but `estimate` is constant and ha...
## ! Fold2: preprocessor 1/1, model 1/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: preprocessor 1/1, model 2/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: preprocessor 1/1, model 3/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: preprocessor 1/1, model 4/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: preprocessor 1/1, model 5/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: preprocessor 1/1, model 6/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: preprocessor 1/1, model 7/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: preprocessor 1/1, model 8/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: preprocessor 1/1, model 9/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: preprocessor 1/1, model 10/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: internal: A correlation computation is required, but `estimate` is constant and ha...
## ! Fold3: preprocessor 1/1, model 1/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: preprocessor 1/1, model 2/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: preprocessor 1/1, model 3/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: preprocessor 1/1, model 4/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: preprocessor 1/1, model 5/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: preprocessor 1/1, model 6/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: preprocessor 1/1, model 7/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: preprocessor 1/1, model 8/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: preprocessor 1/1, model 9/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: preprocessor 1/1, model 10/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: internal: A correlation computation is required, but `estimate` is constant and ha...
## ! Fold4: preprocessor 1/1, model 1/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: preprocessor 1/1, model 2/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: preprocessor 1/1, model 3/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: preprocessor 1/1, model 4/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: preprocessor 1/1, model 5/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: preprocessor 1/1, model 6/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: preprocessor 1/1, model 7/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: preprocessor 1/1, model 8/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: preprocessor 1/1, model 9/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: preprocessor 1/1, model 10/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: internal: A correlation computation is required, but `estimate` is constant and ha...
## ! Fold5: preprocessor 1/1, model 1/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: preprocessor 1/1, model 2/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: preprocessor 1/1, model 3/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: preprocessor 1/1, model 4/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: preprocessor 1/1, model 5/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: preprocessor 1/1, model 6/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: preprocessor 1/1, model 7/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: preprocessor 1/1, model 8/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: preprocessor 1/1, model 9/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: preprocessor 1/1, model 10/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: internal: A correlation computation is required, but `estimate` is constant and ha...
autoplot(tune_res)

collect_metrics(tune_res)
## # A tibble: 200 × 8
##         penalty mixture .metric .estimator      mean     n   std_err .config    
##           <dbl>   <dbl> <chr>   <chr>          <dbl> <int>     <dbl> <chr>      
##  1 0.0000000001       0 rmse    standard   51891.        5 5133.     Preprocess…
##  2 0.0000000001       0 rsq     standard       0.475     5    0.0550 Preprocess…
##  3 0.0000000167       0 rmse    standard   51891.        5 5133.     Preprocess…
##  4 0.0000000167       0 rsq     standard       0.475     5    0.0550 Preprocess…
##  5 0.00000278         0 rmse    standard   51891.        5 5133.     Preprocess…
##  6 0.00000278         0 rsq     standard       0.475     5    0.0550 Preprocess…
##  7 0.000464           0 rmse    standard   51891.        5 5133.     Preprocess…
##  8 0.000464           0 rsq     standard       0.475     5    0.0550 Preprocess…
##  9 0.0774             0 rmse    standard   51891.        5 5133.     Preprocess…
## 10 0.0774             0 rsq     standard       0.475     5    0.0550 Preprocess…
## # … with 190 more rows

The “best” values of this can be selected using select_best();

best_penalty <- select_best(tune_res, metric = "rsq")
best_penalty
## # A tibble: 1 × 3
##   penalty mixture .config               
##     <dbl>   <dbl> <chr>                 
## 1   2154.   0.556 Preprocessor1_Model057
# final model can now be applied on our testing data set to validate its performance
ridge_final <- finalize_workflow(elastic_net_workflow, best_penalty)

ridge_final_fit <- fit(ridge_final, data = ds_train)

augment(ridge_final_fit, new_data = ds_test) %>%
  rsq(truth = salary_in_usd, estimate = .pred)
## Warning: There are new levels in a factor: Cloud Data Engineer, ETL Developer,
## Lead Machine Learning Engineer
## Warning: There are new levels in a factor: CN, HR, RS, MY
## Warning: There are new levels in a factor: HR, SG, MY
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard       0.448

Therefore, from this graph we can derive the larger the penalty value or “regularization amount” and the mixed value, the higher the rmse. And most of the rsq is now up and then down. And according to the icon, we can see that only 45.07% of the data applies to this model. Therefore, we conclude that although Elastic Net Tuning is better for these data than simple linear regression, it is still not a good fit.

Decision Tree

set.seed(2022)

tree_spec <- decision_tree() %>%
  set_engine("rpart")

class_tree_spec <- tree_spec %>%
  set_mode("regression")


class_tree_wf <- workflow() %>%
add_model(class_tree_spec %>% set_args(cost_complexity = tune())) %>%
  add_recipe(ds_recipe)


param_grid <- grid_regular(cost_complexity(range = c(-3, -1)), levels = 10)

tune_res <- tune_grid(class_tree_wf, 
  resamples = ds_folds, 
  grid = param_grid, 
  metrics = metric_set(rsq))
## ! Fold1: preprocessor 1/1, model 1/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: preprocessor 1/1, model 2/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: preprocessor 1/1, model 3/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: preprocessor 1/1, model 4/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: preprocessor 1/1, model 5/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: preprocessor 1/1, model 6/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: preprocessor 1/1, model 7/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: preprocessor 1/1, model 8/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: preprocessor 1/1, model 9/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: preprocessor 1/1, model 10/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold2: preprocessor 1/1, model 1/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: preprocessor 1/1, model 2/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: preprocessor 1/1, model 3/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: preprocessor 1/1, model 4/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: preprocessor 1/1, model 5/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: preprocessor 1/1, model 6/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: preprocessor 1/1, model 7/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: preprocessor 1/1, model 8/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: preprocessor 1/1, model 9/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: preprocessor 1/1, model 10/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold3: preprocessor 1/1, model 1/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: preprocessor 1/1, model 2/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: preprocessor 1/1, model 3/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: preprocessor 1/1, model 4/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: preprocessor 1/1, model 5/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: preprocessor 1/1, model 6/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: preprocessor 1/1, model 7/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: preprocessor 1/1, model 8/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: preprocessor 1/1, model 9/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: preprocessor 1/1, model 10/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold4: preprocessor 1/1, model 1/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: preprocessor 1/1, model 2/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: preprocessor 1/1, model 3/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: preprocessor 1/1, model 4/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: preprocessor 1/1, model 5/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: preprocessor 1/1, model 6/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: preprocessor 1/1, model 7/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: preprocessor 1/1, model 8/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: preprocessor 1/1, model 9/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: preprocessor 1/1, model 10/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold5: preprocessor 1/1, model 1/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: preprocessor 1/1, model 2/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: preprocessor 1/1, model 3/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: preprocessor 1/1, model 4/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: preprocessor 1/1, model 5/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: preprocessor 1/1, model 6/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: preprocessor 1/1, model 7/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: preprocessor 1/1, model 8/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: preprocessor 1/1, model 9/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: preprocessor 1/1, model 10/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
autoplot(tune_res)

arrange(collect_metrics(tune_res), desc(mean))
## # A tibble: 10 × 7
##    cost_complexity .metric .estimator  mean     n std_err .config              
##              <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
##  1         0.001   rsq     standard   0.806     5  0.0599 Preprocessor1_Model01
##  2         0.00167 rsq     standard   0.804     5  0.0608 Preprocessor1_Model02
##  3         0.00278 rsq     standard   0.799     5  0.0609 Preprocessor1_Model03
##  4         0.00464 rsq     standard   0.786     5  0.0601 Preprocessor1_Model04
##  5         0.00774 rsq     standard   0.777     5  0.0623 Preprocessor1_Model05
##  6         0.0129  rsq     standard   0.768     5  0.0623 Preprocessor1_Model06
##  7         0.0215  rsq     standard   0.742     5  0.0634 Preprocessor1_Model07
##  8         0.0359  rsq     standard   0.721     5  0.0612 Preprocessor1_Model08
##  9         0.0599  rsq     standard   0.628     5  0.0603 Preprocessor1_Model09
## 10         0.1     rsq     standard   0.571     5  0.0661 Preprocessor1_Model10
best_complexity <- select_best(tune_res)

class_tree_final <- finalize_workflow(class_tree_wf, best_complexity)

class_tree_final_fit <- fit(class_tree_final, data = ds_train)
class_tree_final_fit %>%
  extract_fit_engine() %>%
  rpart.plot()
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
##     Call rpart.plot with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

head(arrange(collect_metrics(tune_res), desc(mean)),1)
## # A tibble: 1 × 7
##   cost_complexity .metric .estimator  mean     n std_err .config              
##             <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1           0.001 rsq     standard   0.806     5  0.0599 Preprocessor1_Model01

Thus, from the table above, R^2 is 92.3%, which is fit for our data. This is better than Linear regression and Elastic Net Tuning.

Boosted tree

boost_spec <- boost_tree() %>%
  set_engine("xgboost") %>%
  set_mode("regression")

boost_wf <- workflow() %>%
  add_model(boost_spec %>% set_args(trees = tune(),tree_depth =tune())) %>%
  add_recipe(ds_recipe)

boost_grid <- grid_regular(trees(range = c(20,1000)),
                           tree_depth(range = c(1,10)), 
                           levels = 10)

boost_tune_res <- tune_grid(
  boost_wf,
  resamples = ds_folds,
  grid = boost_grid,
  metrics = metric_set(rsq)
)      
## ! Fold1: preprocessor 1/1, model 1/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: preprocessor 1/1, model 2/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: preprocessor 1/1, model 3/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: preprocessor 1/1, model 4/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: preprocessor 1/1, model 5/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: preprocessor 1/1, model 6/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: preprocessor 1/1, model 7/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: preprocessor 1/1, model 8/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: preprocessor 1/1, model 9/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold1: preprocessor 1/1, model 10/10 (predictions): There are new levels in a factor: Finance Data Analyst, NLP Engineer, Ma...
## ! Fold2: preprocessor 1/1, model 1/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: preprocessor 1/1, model 2/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: preprocessor 1/1, model 3/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: preprocessor 1/1, model 4/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: preprocessor 1/1, model 5/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: preprocessor 1/1, model 6/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: preprocessor 1/1, model 7/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: preprocessor 1/1, model 8/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: preprocessor 1/1, model 9/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold2: preprocessor 1/1, model 10/10 (predictions): There are new levels in a factor: Big Data Architect, Head of Machine Le...
## ! Fold3: preprocessor 1/1, model 1/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: preprocessor 1/1, model 2/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: preprocessor 1/1, model 3/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: preprocessor 1/1, model 4/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: preprocessor 1/1, model 5/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: preprocessor 1/1, model 6/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: preprocessor 1/1, model 7/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: preprocessor 1/1, model 8/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: preprocessor 1/1, model 9/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold3: preprocessor 1/1, model 10/10 (predictions): There are new levels in a factor: 3D Computer Vision Researcher, There a...
## ! Fold4: preprocessor 1/1, model 1/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: preprocessor 1/1, model 2/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: preprocessor 1/1, model 3/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: preprocessor 1/1, model 4/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: preprocessor 1/1, model 5/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: preprocessor 1/1, model 6/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: preprocessor 1/1, model 7/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: preprocessor 1/1, model 8/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: preprocessor 1/1, model 9/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold4: preprocessor 1/1, model 10/10 (predictions): There are new levels in a factor: CLP, There are new levels in a factor:...
## ! Fold5: preprocessor 1/1, model 1/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: preprocessor 1/1, model 2/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: preprocessor 1/1, model 3/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: preprocessor 1/1, model 4/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: preprocessor 1/1, model 5/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: preprocessor 1/1, model 6/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: preprocessor 1/1, model 7/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: preprocessor 1/1, model 8/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: preprocessor 1/1, model 9/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
## ! Fold5: preprocessor 1/1, model 10/10 (predictions): There are new levels in a factor: Staff Data Scientist, Data Analytics L...
autoplot(boost_tune_res)

From this figure, we can see that most Tree deep rsq does not change with the increase of tree, while a small part increases with the increase of Tree.

finally_res <- arrange(collect_metrics(boost_tune_res), desc(mean))
head(finally_res)
## # A tibble: 6 × 8
##   trees tree_depth .metric .estimator  mean     n std_err .config               
##   <int>      <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                 
## 1  1000          4 rsq     standard   0.954     5  0.0114 Preprocessor1_Model040
## 2   782          4 rsq     standard   0.954     5  0.0114 Preprocessor1_Model038
## 3   891          4 rsq     standard   0.954     5  0.0114 Preprocessor1_Model039
## 4   673          4 rsq     standard   0.954     5  0.0114 Preprocessor1_Model037
## 5   564          4 rsq     standard   0.954     5  0.0114 Preprocessor1_Model036
## 6   455          4 rsq     standard   0.954     5  0.0114 Preprocessor1_Model035

Thus,we can see this model`s R value is nearly 95%. Thus, this Boosted Trees is the best model we should choose.

Conclusion

  • Data science jobs are becoming more popular. Not only are the number of jobs increasing, but the average salary is also increasing every year.

  • If only opportunity salary analysis, an employee wants to get the highest salary possible, the United States should be their choice.

  • Large and medium-sized companies pay higher wages than small and medium-sized companies.

  • Most people are employed full time, and the wages of full-time employees are significantly higher than those of part-time and contract workers.

  • Data Engineer, data scientist, and machine learning engineer are the most valuable titles (based on their average salaries).

  • The number of years worked in this field is directly proportional to the salary. That means staying in the industry longer, gaining experience and moving up, and then getting a big pay bump.

  • Most of the data comes from the United States. And the US pays much higher wages than other countries.

  • The best fit model for this dataset is boosted tree, R^2 value is 95% when tree deep is 3.